loading the data and the necessary libraries :

library(readr)
library(lawstat)
library(tidyverse)
library(highcharter)
library(reshape2)
library(scales)
library(car)
library(ggthemr)
library(Hmisc)
library(corrplot)
library(gridExtra)
train <- read.csv("/Users/deyapple/Documents/Courses/Term04/DA/house/train.csv", 
                  stringsAsFactors = FALSE) %>%
  mutate(houseAge = -1 * (YearBuilt - YrSold), 
         grAge = -1 * (GarageYrBlt - YrSold))
test <- read.csv("/Users/deyapple/Documents/Courses/Term04/DA/house/test.csv", 
                 stringsAsFactors = FALSE) %>%
  mutate(houseAge = -1 * (YearBuilt - YrSold), 
         grAge = -1 * (GarageYrBlt - YrSold))
ggthemr('earth', type = 'inner')

Q1 :

Choosing quantative Variables

At this early stage, I have only considered quantitive variables. In the later sections I will add a few qualitative predictors to the model as well. Some variables are masked as quantitative, when in nature they are actually qualitative. I have removed these variables too, such as Month Sold.

quan.df <- train %>%
  select_if(is.numeric) %>%
  select(-YearBuilt, -YrSold, -GarageYrBlt, -MoSold, -MSSubClass, -Id)

Correlation Matrix

Next, I have constructed a correlation matrix from which, using ggplot2 I have drawn a Heatmap. (I have removed the code part to keep the document tidy)

clearer visualization :

Correlation test

Next I have made a dataframe consisting of the correlations and their \(p-values\) under the null hypothesis that the true correlation is zero.

pvalues <- quan.cor_and_p$P

cor_and_p <- flattenCorrMatrix(correlations, pvalues)

cor_and_p %>%
  filter(cor < 1) %>%
  filter(column == "SalePrice") %>%
  arrange(desc(cor)) %>%
  head(50)
             row    column         cor            p
1    OverallQual SalePrice  0.79098159 0.000000e+00
2      GrLivArea SalePrice  0.70862448 0.000000e+00
3     GarageCars SalePrice  0.64040917 0.000000e+00
4     GarageArea SalePrice  0.62343144 0.000000e+00
5    TotalBsmtSF SalePrice  0.61358052 0.000000e+00
6      X1stFlrSF SalePrice  0.60585219 0.000000e+00
7       FullBath SalePrice  0.56066376 0.000000e+00
8   TotRmsAbvGrd SalePrice  0.53372318 0.000000e+00
9   YearRemodAdd SalePrice  0.50710094 0.000000e+00
10    MasVnrArea SalePrice  0.47749305 0.000000e+00
11    Fireplaces SalePrice  0.46692884 0.000000e+00
12    BsmtFinSF1 SalePrice  0.38641980 0.000000e+00
13   LotFrontage SalePrice  0.35179910 0.000000e+00
14    WoodDeckSF SalePrice  0.32441345 0.000000e+00
15     X2ndFlrSF SalePrice  0.31933379 0.000000e+00
16   OpenPorchSF SalePrice  0.31585622 0.000000e+00
17      HalfBath SalePrice  0.28410769 0.000000e+00
18       LotArea SalePrice  0.26384336 0.000000e+00
19  BsmtFullBath SalePrice  0.22712223 0.000000e+00
20     BsmtUnfSF SalePrice  0.21447910 0.000000e+00
21  BedroomAbvGr SalePrice  0.16821316 9.927481e-11
22   ScreenPorch SalePrice  0.11144657 1.972139e-05
23      PoolArea SalePrice  0.09240355 4.073492e-04
24    X3SsnPorch SalePrice  0.04458367 8.858169e-02
25    BsmtFinSF2 SalePrice -0.01137812 6.639986e-01
26  BsmtHalfBath SalePrice -0.01684415 5.201537e-01
27       MiscVal SalePrice -0.02118958 4.184863e-01
28  LowQualFinSF SalePrice -0.02560613 3.282073e-01
29   OverallCond SalePrice -0.07785589 2.912352e-03
30 EnclosedPorch SalePrice -0.12857796 8.255763e-07
31  KitchenAbvGr SalePrice -0.13590737 1.860428e-07

Top 10 Highest Correlations

Finally I have selected the variables with the highest correlations. For now, these will be our predictors.

top10 <- quan.cor.melt %>%
  filter(Var1 == "SalePrice") %>%
  filter(Var2 != "SalePrice") %>%
  arrange(desc(value)) %>%
  head(10)

In the data frame from the last section, if we search for the \(p-value\) of these variables, we will see that the \(p-values\) are very low, which results in the rejection of the null hypothesis : There is no correlation between SalePrice and this variable.

Q2 :

Drawing The Scatter Plot

Selecting the top 10 predictors from the data frame:

quan.predictors <- top10 %>%
  select(Var2) %>%
  unname() %>%
  unlist() %>%
  as.character()
var.nm <- c(quan.predictors, c("SalePrice"))
quan.df.sel <- quan.df %>%
  select(one_of(var.nm))

Draw two by two scatter plot :

scatterplotMatrix(quan.df.sel)

Determining Colinearity

Assembling the correlation matrix :

(Again, I have’nt included the code in order to keep the document tidy)

As we can see from the correlation matrix and the bottom row of the scatter plot matrix, these sets of variables are colinear :
  • X1stFlrSF and TotalBsmtSF
  • GrLivArea and TotalRmsAbvGrd
  • GarageArea and GarageCars
  • It makes sense for these variables to be colinear. Without any statistical knowledge, it is intuitively obvious that an increase in GarageCars would result in an increase in GarageArea. Later on, if any of these predictors still exist in the model, I will use the vif() function to determine whether they are colinear or not.

    Q3:

    quan.df.reg <- lm(SalePrice ~ GarageArea + YearRemodAdd +
                        GarageCars + TotRmsAbvGrd + FullBath + 
                        GrLivArea + X1stFlrSF + 
                        TotalBsmtSF + OverallQual + MasVnrArea, data = quan.df, 
                      na.action = na.exclude)
    quan.df.reg.sum <- summary(quan.df.reg)
    quan.df.reg.sum
    
    Call:
    lm(formula = SalePrice ~ GarageArea + YearRemodAdd + GarageCars + 
        TotRmsAbvGrd + FullBath + GrLivArea + X1stFlrSF + TotalBsmtSF + 
        OverallQual + MasVnrArea, data = quan.df, na.action = na.exclude)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -481370  -18855   -1092   16075  295028 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -9.311e+05  1.165e+05  -7.995 2.63e-15 ***
    GarageArea    1.235e+01  1.029e+01   1.200  0.23024    
    YearRemodAdd  4.330e+02  6.005e+01   7.210 9.00e-13 ***
    GarageCars    1.237e+04  2.998e+03   4.126 3.90e-05 ***
    TotRmsAbvGrd -2.601e+02  1.113e+03  -0.234  0.81525    
    FullBath     -1.813e+03  2.552e+03  -0.710  0.47755    
    GrLivArea     4.387e+01  4.156e+00  10.557  < 2e-16 ***
    X1stFlrSF     1.287e+01  4.915e+00   2.617  0.00896 ** 
    TotalBsmtSF   2.134e+01  4.238e+00   5.036 5.36e-07 ***
    OverallQual   1.992e+04  1.167e+03  17.069  < 2e-16 ***
    MasVnrArea    3.774e+01  6.272e+00   6.018 2.23e-09 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 37760 on 1441 degrees of freedom
      (8 observations deleted due to missingness)
    Multiple R-squared:  0.7747,    Adjusted R-squared:  0.7732 
    F-statistic: 495.6 on 10 and 1441 DF,  p-value: < 2.2e-16

    Q4:

    Performing the Prediction:

    quan.df.pred <- quan.df %>%
      mutate(SalePricePred = fitted(quan.df.reg, type = "response"), 
             SalePriceError = residuals.lm(quan.df.reg)) 

    Predicted Price Vs. Actual Price:

    one.pa <- ggplot(quan.df.pred, aes(x = SalePrice, y = SalePricePred)) + 
      geom_point() + 
      geom_smooth() 
    one.pa

    quan.df.predh <- quan.df.pred %>%
      filter(!is.na(SalePrice)) %>%
      filter(!is.na(SalePricePred))
    hchart(type = "area", density(quan.df.predh$SalePrice), name = "Actual") %>%
      hc_add_series(type = "area", data = density(quan.df.predh$SalePricePred), name = "Prediction") %>%
      hc_add_theme(hc_theme_db())

    As we can see, the predicted price and the actual price are very close in some instances. That is, the scatter plot resembles the line \(x = y\) (the bisector of the first and third segment)

    Error Density Plot:

    one.ed <- ggplot(quan.df.pred) + 
      geom_histogram(aes(y = ..density.., x = SalePriceError), alpha = 0.5, bins = 150) + 
      geom_density(aes(x = SalePriceError), size = 0.5) 
    
    one.ed

    As we can see the error chart is right-skewed bell shape, indicating a somewhat normal distribution for the error.

    Q5:

    The RSE (Residual Standard Error) of the model is computed as below : \(RSE = \sqrt{\frac{RSS}{n - 2}}\) where RSS is the Residual Sum of Squares of the model. Since RSE is based on the scale of Y (the response variable) it does not give a clear view on whether the model is a good fit or not. In order to test whether the model is a good fit for the data, we use the \(R-Squared\) value, which is computed as follows: \(R^2= \frac{TSS - RSS}{TSS}\) where \(TSS = \Sigma{(y_i - \bar{y})^2}\) is the Total Sum of Squares. TSS measures the total variance in Y (the response variable) before performing the regression. RSS measures the amount of variability that is left unexplained by the model after the regression if performed. therefor \(R^2\) is a measure between 1 and 0 indicating how much of the variability in the response is explained by our regression model. It differs from RSE since it is always between 1 and 0 and therefor independent of the scale of Y.

    quan.df.reg.sum$r.squared
    [1] 0.774743

    As we can see, the \(R^2\) value is relatively high, therefor most of the variance in the response is explained by our regression model.

    The \(F-statistic\) is our answer to the following question: Is there a relationship between the repsonse and at least one of our predictors? In other words, we want to test the null hypothesis: \[H_0 : \beta_0 = \beta_1 = ... = \beta_p = 0\] versus the alternative hypothesis : \[H_a: {at~least~one~\beta_j~is~non-zero}\] to perform this hypothesis we must compute the \(F-statistic\), \[F = \frac{\frac{TSS - RSS}{p}}{\frac{RSS}{n - p - 1}}\] where \(p\) is the number of predictors and \(n\) is the sample size. Based on a F-distribution table, we can find out the minimum value of F required for us to reject the null hypothesis under differnet values of \(\alpha\).

    quan.df.reg.sum$fstatistic
        value     numdf     dendf 
     495.6137   10.0000 1441.0000 

    As we can see from the relatively large value of the \(F-statistic\) and the corresponding low \(p-value\), we can reject the null hypothesis that there is no relationship between the response and any of the predictors.

    Q6:

    quan.df.reg.sum
    
    Call:
    lm(formula = SalePrice ~ GarageArea + YearRemodAdd + GarageCars + 
        TotRmsAbvGrd + FullBath + GrLivArea + X1stFlrSF + TotalBsmtSF + 
        OverallQual + MasVnrArea, data = quan.df, na.action = na.exclude)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -481370  -18855   -1092   16075  295028 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -9.311e+05  1.165e+05  -7.995 2.63e-15 ***
    GarageArea    1.235e+01  1.029e+01   1.200  0.23024    
    YearRemodAdd  4.330e+02  6.005e+01   7.210 9.00e-13 ***
    GarageCars    1.237e+04  2.998e+03   4.126 3.90e-05 ***
    TotRmsAbvGrd -2.601e+02  1.113e+03  -0.234  0.81525    
    FullBath     -1.813e+03  2.552e+03  -0.710  0.47755    
    GrLivArea     4.387e+01  4.156e+00  10.557  < 2e-16 ***
    X1stFlrSF     1.287e+01  4.915e+00   2.617  0.00896 ** 
    TotalBsmtSF   2.134e+01  4.238e+00   5.036 5.36e-07 ***
    OverallQual   1.992e+04  1.167e+03  17.069  < 2e-16 ***
    MasVnrArea    3.774e+01  6.272e+00   6.018 2.23e-09 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 37760 on 1441 degrees of freedom
      (8 observations deleted due to missingness)
    Multiple R-squared:  0.7747,    Adjusted R-squared:  0.7732 
    F-statistic: 495.6 on 10 and 1441 DF,  p-value: < 2.2e-16

    in order to select the variables that truly have an effect on the response, I’ve used the \(Backward Selection\) method. The method is performed as follows: We start with all the predictors in our model and omit the predictor with the highest \(p-value\). this is the least statistically significant predictor in our model. We then fit the new model using the remaining \(p - 1\) predictors and start again. We keep deleting predictors with high \(p-values\) until we reach predictors with \(p-values\) low enough to be statistically significant.

    step 1: Removing TotRmsAbvGrd

    re-fitting the model :

    quan.df.reg.mod1 <- update(quan.df.reg, . ~ . - TotRmsAbvGrd, na.action = na.exclude)
    quan.df.reg.mod1.sum <- summary(quan.df.reg.mod1)
    
    quan.df.pred <- quan.df.pred %>%
      mutate(SalePricePredMod1 = fitted(quan.df.reg.mod1, type = "response"), 
             SalePriceErrorMod1 = residuals.lm(quan.df.reg.mod1))

    drawing plots:

    two.pa <- ggplot(quan.df.pred, aes(x = SalePrice, y = SalePricePredMod1)) + 
      geom_point() + 
      geom_smooth() 
    two.pa

    quan.df.predh <- quan.df.pred %>%
      filter(!is.na(SalePrice)) %>%
      filter(!is.na(SalePricePredMod1))
    hchart(type = "area", density(quan.df.predh$SalePrice), name = "Actual") %>%
      hc_add_series(type = "area", data = density(quan.df.predh$SalePricePredMod1), name = "Prediction 1") %>%
      hc_add_theme(hc_theme_db())
    two.ed <- ggplot(quan.df.pred) + 
      geom_histogram(aes(y = ..density.., x = SalePriceErrorMod1), alpha = 0.5, bins = 150) + 
      geom_density(aes(x = SalePriceErrorMod1), size = 0.5) 
    two.ed

    calculating \(R^2\) and the \(F-statistic\)

    quan.df.reg.mod1.sum$r.squared
    [1] 0.7747344
    quan.df.reg.mod1.sum$fstatistic
       value    numdf    dendf 
     551.037    9.000 1442.000 

    seeing whether or not there are predictors to delete in the next round:

    quan.df.reg.mod1.sum
    
    Call:
    lm(formula = SalePrice ~ GarageArea + YearRemodAdd + GarageCars + 
        FullBath + GrLivArea + X1stFlrSF + TotalBsmtSF + OverallQual + 
        MasVnrArea, data = quan.df, na.action = na.exclude)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -480566  -18728   -1121   16064  296091 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -9.333e+05  1.160e+05  -8.044 1.80e-15 ***
    GarageArea    1.252e+01  1.026e+01   1.220  0.22276    
    YearRemodAdd  4.337e+02  5.995e+01   7.235 7.52e-13 ***
    GarageCars    1.232e+04  2.990e+03   4.122 3.98e-05 ***
    FullBath     -1.881e+03  2.535e+03  -0.742  0.45826    
    GrLivArea     4.318e+01  2.923e+00  14.773  < 2e-16 ***
    X1stFlrSF     1.285e+01  4.914e+00   2.616  0.00899 ** 
    TotalBsmtSF   2.143e+01  4.222e+00   5.074 4.40e-07 ***
    OverallQual   1.994e+04  1.163e+03  17.138  < 2e-16 ***
    MasVnrArea    3.779e+01  6.267e+00   6.030 2.08e-09 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 37750 on 1442 degrees of freedom
      (8 observations deleted due to missingness)
    Multiple R-squared:  0.7747,    Adjusted R-squared:  0.7733 
    F-statistic:   551 on 9 and 1442 DF,  p-value: < 2.2e-16

    step 2: Removing FullBath

    re-fitting the model:

    quan.df.reg.mod2 <- update(quan.df.reg.mod1, . ~ . - FullBath, na.action = na.exclude)
    quan.df.reg.mod2.sum <- summary(quan.df.reg.mod2)
    
    quan.df.pred <- quan.df.pred %>%
      mutate(SalePricePredMod2 = fitted(quan.df.reg.mod2, type = "response"), 
             SalePriceErrorMod2 = residuals.lm(quan.df.reg.mod2))

    drawing plots :

    three.pa <- ggplot(quan.df.pred, aes(x = SalePrice, y = SalePricePredMod2)) + 
      geom_point() + 
      geom_smooth()
    three.pa

    quan.df.predh <- quan.df.pred %>%
      filter(!is.na(SalePrice)) %>%
      filter(!is.na(SalePricePredMod2))
    hchart(type = "area", density(quan.df.predh$SalePrice), name = "Actual") %>%
      hc_add_series(type = "area", data = density(quan.df.predh$SalePricePredMod2), name = "Prediction 2") %>%
      hc_add_theme(hc_theme_db())
    three.ed <- ggplot(quan.df.pred) + 
      geom_histogram(aes(y = ..density.., x = SalePriceErrorMod2), alpha = 0.5, bins = 150) + 
      geom_density(aes(x = SalePriceErrorMod2), size = 0.5) 
    three.ed

    calculating \(R^2\) and the \(F-statistic\)

    quan.df.reg.mod2.sum$r.squared
    [1] 0.7746484
    quan.df.reg.mod2.sum$fstatistic
        value     numdf     dendf 
     620.0411    8.0000 1443.0000 

    seeing whether or not there are predictors to delete in the next round:

    quan.df.reg.mod2.sum
    
    Call:
    lm(formula = SalePrice ~ GarageArea + YearRemodAdd + GarageCars + 
        GrLivArea + X1stFlrSF + TotalBsmtSF + OverallQual + MasVnrArea, 
        data = quan.df, na.action = na.exclude)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -478209  -18973   -1206   15861  296677 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -9.138e+05  1.130e+05  -8.087 1.28e-15 ***
    GarageArea    1.327e+01  1.021e+01   1.299  0.19406    
    YearRemodAdd  4.235e+02  5.834e+01   7.260 6.31e-13 ***
    GarageCars    1.198e+04  2.954e+03   4.056 5.25e-05 ***
    GrLivArea     4.225e+01  2.642e+00  15.995  < 2e-16 ***
    X1stFlrSF     1.276e+01  4.911e+00   2.598  0.00948 ** 
    TotalBsmtSF   2.161e+01  4.214e+00   5.127 3.34e-07 ***
    OverallQual   1.984e+04  1.156e+03  17.167  < 2e-16 ***
    MasVnrArea    3.786e+01  6.265e+00   6.044 1.91e-09 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 37740 on 1443 degrees of freedom
      (8 observations deleted due to missingness)
    Multiple R-squared:  0.7746,    Adjusted R-squared:  0.7734 
    F-statistic:   620 on 8 and 1443 DF,  p-value: < 2.2e-16

    step 3 : Removing GarageArea

    re-fitting the model:

    quan.df.reg.mod3 <- update(quan.df.reg.mod2, . ~ . - GarageArea, 
                               na.action = na.exclude)
    quan.df.reg.mod3.sum <- summary(quan.df.reg.mod3)
    
    quan.df.pred <- quan.df.pred %>%
      mutate(SalePricePredMod3 = fitted(quan.df.reg.mod3, type = "response"), 
             SalePriceErrorMod3 = residuals.lm(quan.df.reg.mod3))

    drawing plots :

    four.pa <- ggplot(quan.df.pred, aes(x = SalePrice, y = SalePricePredMod3)) + 
      geom_point() + 
      geom_smooth() 
    three.pa

    quan.df.predh <- quan.df.pred %>%
      filter(!is.na(SalePrice)) %>%
      filter(!is.na(SalePricePredMod3))
    hchart(type = "area", density(quan.df.predh$SalePrice), name = "Actual") %>%
      hc_add_series(type = "area", data = density(quan.df.predh$SalePricePredMod3), name = "Prediction 3") %>%
      hc_add_theme(hc_theme_db())
    four.ed <- ggplot(quan.df.pred) + 
      geom_histogram(aes(y = ..density.., x = SalePriceErrorMod3), alpha = 0.5, bins = 150) + 
      geom_density(aes(x = SalePriceErrorMod3), size = 0.5) 
    three.ed

    calculating \(R^2\) and the \(F-statistic\)

    quan.df.reg.mod3.sum$r.squared
    [1] 0.7743848
    quan.df.reg.mod3.sum$fstatistic
        value     numdf     dendf 
     708.0398    7.0000 1444.0000 

    seeing whether or not there are predictors to delete in the next round:

    quan.df.reg.mod3.sum
    
    Call:
    lm(formula = SalePrice ~ YearRemodAdd + GarageCars + GrLivArea + 
        X1stFlrSF + TotalBsmtSF + OverallQual + MasVnrArea, data = quan.df, 
        na.action = na.exclude)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -471039  -18730   -1136   16449  296004 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -9.111e+05  1.130e+05  -8.062 1.56e-15 ***
    YearRemodAdd  4.220e+02  5.834e+01   7.234 7.60e-13 ***
    GarageCars    1.509e+04  1.736e+03   8.691  < 2e-16 ***
    GrLivArea     4.238e+01  2.641e+00  16.048  < 2e-16 ***
    X1stFlrSF     1.317e+01  4.902e+00   2.687  0.00729 ** 
    TotalBsmtSF   2.211e+01  4.198e+00   5.268 1.59e-07 ***
    OverallQual   1.980e+04  1.156e+03  17.136  < 2e-16 ***
    MasVnrArea    3.830e+01  6.257e+00   6.121 1.20e-09 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 37750 on 1444 degrees of freedom
      (8 observations deleted due to missingness)
    Multiple R-squared:  0.7744,    Adjusted R-squared:  0.7733 
    F-statistic:   708 on 7 and 1444 DF,  p-value: < 2.2e-16

    since all of our p-values are less than 0.05, therefor we can reject the null hypothesis that there is no relationship between these predictors and the response (\(H_0: = \beta_p = 0\)) with a significance level of \(\alpha = 0.05\) then we stop here and no longer remove any predictors in this stage.

    Quick Visual Comparison

    Predicted Vs. Actual :

    grid.arrange(one.pa, two.pa, three.pa, four.pa, nrow = 2)

    Error Density Plots:

    grid.arrange(one.ed, two.ed, three.ed, four.ed, nrow = 2)

    in order to see whether the remaining predictors are colinear (as I suggested in Question 2) I have used the vif() function. the closer the output is to 1 the less a predictor is correlated with the others.

    vif(quan.df.reg.mod3)
    YearRemodAdd   GarageCars    GrLivArea    X1stFlrSF  TotalBsmtSF 
        1.478150     1.718849     1.961472     3.626596     3.443565 
     OverallQual   MasVnrArea 
        2.594410     1.306956 

    since none of the values are that high, I won’t change the model :

    quan.df.fit <- quan.df.reg.mod3

    Q7

    Constant Variance

    The Residuals Vs. Fitted plot on the top and the Standardized Residuals Vs. Fitted plot on the bottom left are a good test for this assumption. The plots show that the residuals differ slightly in variance. If the Residuals had constant variance, then there would be no visible pattern in the red line and the line would be somewhat horizontal. As you can see, this is clearly not the case.

    par(mfrow=c(2,2))
    plot(quan.df.fit)

    Normality

    This plot (similar to the top right chart in the previous section) shows whether our Residuals have a normal distribution or not. The closer the circles are to the dashed line, the more normal the distribution of the Residuals. As we can see the residuals tend to take distance from the dashed line towards the end. To fix this we can use sqrt() and log()

    car::qqPlot(quan.df.fit, id.method="identify",
                simulate = TRUE, main="Q-Q Plot")

    using sqrt():

    quan.df.fit.sqrt <- update(quan.df.fit, sqrt(SalePrice) ~ ., 
                               na.action = na.exclude)
    car::qqPlot(quan.df.fit.sqrt, id.method="identify",
                simulate = TRUE, main="Q-Q Plot")

    using log():

    quan.df.fit.log <- update(quan.df.fit, sqrt(SalePrice) ~ ., 
                               na.action = na.exclude)
    
    car::qqPlot(quan.df.fit.log, id.method="identify",
                simulate = TRUE, main="Q-Q Plot")

    The log() works slightly better in case of normality than sqrt()

    Independence

    in other words we would like to check whether the residuals are auto-correlated. When the residuals are autocorrelated, it means that the current value is dependent on the previous value. there are a few ways to determine whether the residuals are dependent (auto-correlated) or not. 1. Using acf plot

    acf(quan.df.fit.log$residuals)

    The first line shows the auto-correlation of the residual with itself, therefor it is always equal to one. As we can see, the next line drops below the blue dashed line (significance level) from which we can conclude that the residuals are not auto-correlated(they are independent).

    1. Using runs test

    In the runs test, the null hypothesis is that the Residuals are random (a value is not affected by its previous value.)

    lawstat::runs.test(quan.df.fit.log$residuals)
    
        Runs Test - Two sided
    
    data:  quan.df.fit.log$residuals
    Standardized Runs Statistic = -0.21002, p-value = 0.8337

    Since the \(p-value\) attained from this test is very high, we fail to reject the null hypothesis.

    1. Using Durbin-Watson test.
    lmtest::dwtest(quan.df.fit.log)
    
        Durbin-Watson test
    
    data:  quan.df.fit.log
    DW = 1.9978, p-value = 0.4831
    alternative hypothesis: true autocorrelation is greater than 0

    All three methods used prove that the assumption of Independence regarding the residuals holds.

    Q8:

    Dividing the train dataset into training and testing parts:

    samp_size = floor(0.8 * nrow(train))
    train_ind = sample(seq_len(nrow(train)), size = samp_size)
    train.tmp <- train[train_ind, ] 
    test.tmp <- train[-train_ind, ]

    fitting the model :

    quan.df.reg.tmp <- lm(log(SalePrice) ~ YearRemodAdd + GarageCars
                          + GrLivArea + X1stFlrSF + TotalBsmtSF + 
                            OverallQual + MasVnrArea, data = train.tmp)

    Drawing the plot :

    test.tmp <- test.tmp %>%
      mutate(SalePricePred = exp(predict(quan.df.reg.tmp, 
                                     type = "response", 
                                     newdata = test.tmp)), 
             SalePriceError = (SalePrice - SalePricePred) ^ 2)
    
    ggplot(test.tmp) + 
      geom_point(aes(x = SalePrice, y = SalePricePred))

    ggplot(test.tmp) + 
      geom_histogram(aes(y = ..density.., x = SalePriceError), alpha = 0.5, bins = 150) + 
      geom_density(aes(x = SalePriceError), size = 0.5) 

    Determining the error of our prediction:

    In order to achieve this, I have calculated the Standard Mean Squared Error, \[MSE = \frac{1}{n}\sqrt{\Sigma(\hat{Y_i} - Y_i)^2} \]

    MSE <- sqrt(mean(test.tmp$SalePriceError, na.rm = T))
    MSE
    [1] 108427.1

    Q9:

    Based on the plots drawn in Question 2, I have drawn scatter plots of Sale Price based on predictors that seemed to have a non-linear relationship, which are:
  • OverallQual
  • GarageCars
  • GarageArea (deleted due to low p-value in Question 6)
  • TotalBsmtSF (deleted due to low p-value in Question 6)
  • X1stFlrSF
  • Fireplaces (deleted due to low p-value in Question 6)
  • OverallQual

    based on the chart below, it seems like SalePrice is either a degree 3 polynomial function of OverallQual.

    ggplot(quan.df, aes(x = OverallQual, y = SalePrice))  +
      geom_count() + 
      geom_smooth() 

    GarageCars

    SalePrice seems to be a degree 2 polynomial function of GarageCars

    ggplot(quan.df, aes(x = GarageCars, y = SalePrice)) +
      geom_count() 

    X1stFlrSF

    X1stFlrSF seems to have a linear relationship with SalePrice up until values higher than 3000. These Values can be considered leverages. Leverages can be dangerous in model fitting. Therefor we can delete them from the training set.

    ggplot(quan.df, aes(x = X1stFlrSF, y = SalePrice)) + 
      geom_point() + 
      geom_smooth()

    after removing leverages :

    train <- train %>%
      filter(X1stFlrSF < 3000)
    ggplot(train, aes(x = X1stFlrSF, y = SalePrice)) + 
      geom_point() + 
      geom_smooth()

    Q10:

    The Final Model

    Based on the calculations above, I have made a few adjustments to the model. Also, I have added a few categorical predictors that seemed to have a a strong effect on the response. Lastly, after observing the data set and the discription of each column, I have added a few extra predictors that seemed to affect the price intuitively, such as: * OverallCond * houseAge * YearBuilt * MSZoning * MSSubClass * CentralAir

    quan.df.final <- lm(log(SalePrice) ~  
                          poly(OverallQual, 3) + poly(OverallCond, 3) +
                          X1stFlrSF +
                          poly(GarageCars, 2) + 
                          TotalBsmtSF + GrLivArea + houseAge +
                          YearRemodAdd + YearBuilt + MSZoning + MSSubClass +
                          CentralAir, 
                        data = train,
                        na.action = na.exclude)
    summary(quan.df.final)
    
    Call:
    lm(formula = log(SalePrice) ~ poly(OverallQual, 3) + poly(OverallCond, 
        3) + X1stFlrSF + poly(GarageCars, 2) + TotalBsmtSF + GrLivArea + 
        houseAge + YearRemodAdd + YearBuilt + MSZoning + MSSubClass + 
        CentralAir, data = train, na.action = na.exclude)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.81005 -0.07033 -0.00134  0.07714  0.55951 
    
    Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
    (Intercept)            1.374e+01  5.230e+00   2.627  0.00870 ** 
    poly(OverallQual, 3)1  4.104e+00  2.242e-01  18.304  < 2e-16 ***
    poly(OverallQual, 3)2  8.389e-01  1.535e-01   5.465 5.44e-08 ***
    poly(OverallQual, 3)3  1.940e-01  1.445e-01   1.342  0.17966    
    poly(OverallCond, 3)1  2.192e+00  1.705e-01  12.854  < 2e-16 ***
    poly(OverallCond, 3)2 -3.870e-01  1.480e-01  -2.614  0.00903 ** 
    poly(OverallCond, 3)3 -2.027e-01  1.427e-01  -1.421  0.15555    
    X1stFlrSF              1.583e-06  1.768e-05   0.090  0.92868    
    poly(GarageCars, 2)1   1.964e+00  1.839e-01  10.678  < 2e-16 ***
    poly(GarageCars, 2)2  -3.070e-01  1.472e-01  -2.086  0.03718 *  
    TotalBsmtSF            1.761e-04  1.549e-05  11.364  < 2e-16 ***
    GrLivArea              2.912e-04  1.001e-05  29.104  < 2e-16 ***
    houseAge              -4.740e-03  2.606e-03  -1.819  0.06916 .  
    YearRemodAdd           1.161e-03  2.460e-04   4.719 2.60e-06 ***
    YearBuilt             -2.475e-03  2.615e-03  -0.946  0.34420    
    MSZoningFV             3.939e-01  4.691e-02   8.395  < 2e-16 ***
    MSZoningRH             3.340e-01  5.390e-02   6.197 7.48e-10 ***
    MSZoningRL             3.729e-01  4.346e-02   8.580  < 2e-16 ***
    MSZoningRM             2.669e-01  4.368e-02   6.111 1.27e-09 ***
    MSSubClass            -2.197e-04  9.452e-05  -2.325  0.02023 *  
    CentralAirY            6.609e-02  1.670e-02   3.958 7.94e-05 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.1313 on 1436 degrees of freedom
    Multiple R-squared:  0.8932,    Adjusted R-squared:  0.8918 
    F-statistic: 600.7 on 20 and 1436 DF,  p-value: < 2.2e-16

    predicting the Sale Prices for the testing data:

    test <- test %>%
      mutate(SalePricePred = exp(predict(quan.df.final, 
                                         type = "response", 
                                         newdata = test)))

    Kaggle Submission File :

    result <- test %>%
      select(Id, SalePrice = SalePricePred)
    write.csv(result, file = "/Users/deyapple/Documents/result.csv", row.names = F)

    Kaggle Ranking (#2433):

    since some of the computed values were NA, before submitting them to Kaggle, I replaced them with the mean prediction by hand.